tornado_data <- read.csv("./data/1950-2023_actual_tornadoes.csv", na.strings = c("NA", "N/A", " "))

cleaned <- tornado_data |>
  filter(yr >= 2000) |>
  select(-tz, -stf, -ns, -sn, -sg, -f1, -f2, -f3, -f4) |>
  relocate(fc, .after = mag)|>
  drop_na()
  

write.csv(cleaned, "cleaned_df.csv", row.names = FALSE)

1. What are the key factors that contribute to the number of injuries caused by tornadoes?

1. Pruning
# Pruning
best_cp <- tree_model$cptable[which.min(tree_model$cptable[, "xerror"]), "CP"]


pruned_tree <- prune(tree_model, cp = best_cp)

Pruned tree model

rpart.plot(pruned_tree, type = 3, extra = 101, fallen.leaves = TRUE)

The pruned decision tree identifies magnitude (mag) as the primary factor influencing tornado-related injuries. Tornadoes with mag < 4 predict minimal injuries, averaging 0.2 injuries in most cases, indicating that low-magnitude tornadoes pose minimal risk. For tornadoes with mag >= 4, injuries increase substantially, with additional splits showing that long paths (len >= 75 miles) and high property loss (loss >= 23) are associated with severe injury outcomes. This hierarchy underscores the dominant role of magnitude, with path length and property loss acting as secondary risk factors in extreme scenarios.

2. Cross-Validation
# Cross-Validation
set.seed(123)
train_control <- trainControl(method = "cv", number = 10)
cv_model <- train(
  inj ~ mag + wid + len + loss,
  data = cleaned,
  method = "rpart",
  trControl = train_control,
  tuneLength = 10
)

Cross-validated tree model

rpart.plot(cv_model$finalModel, type = 3, extra = 101, fallen.leaves = TRUE)

The cross-validated tree confirms that magnitude (mag) is the most critical factor, with injuries significantly increasing for mag >= 4. This model captures more nuanced relationships among variables, such as the interplay between path length (len) and property loss (loss). Long paths (len >= 75) consistently predict the highest injury counts, while shorter paths refine the predictions based on magnitude and loss. This suggests that while cross-validation slightly improves predictive accuracy, the variable hierarchy remains consistent with the pruned tree.

Importance

tree_model$variable.importance
##        mag        len       loss        wid 
## 947829.455 605831.823  63904.583   9161.009
cv_model$finalModel$variable.importance
##        mag        len       loss        wid 
## 947829.455 605831.823  63904.583   9161.009

Variable importance rankings in both models reinforce the dominance of magnitude (mag), which has a significantly higher score than the other variables. Path length (len) is the second most influential factor, followed by property loss (loss). Tornado width (wid) has a minor role, suggesting that its contribution to injury predictions is less direct. These rankings highlight the importance of focusing on magnitude and path length when assessing tornado injury risks.

The decision tree visualizations clearly illustrate key thresholds in the data. For mag >= 4, injuries escalate, particularly when combined with long paths (len >= 75) or high property loss (loss >= 23). Tornadoes with mag < 4 result in minimal injuries regardless of other variables. These thresholds provide actionable insights for disaster preparedness, emphasizing the need to monitor and respond aggressively to high-magnitude and long-path tornadoes.

2. How do tornado characteristics predict the occurrence of longer-path tornadoes?

library(pROC)
library(caret)

set.seed(123)

cleaned$long_path_binary <- factor(ifelse(cleaned$len > quantile(cleaned$len, 0.75), 1, 0),
                                   levels = c(0, 1), labels = c("No", "Yes"))

# Set up Cross-Validation

cv_control <- trainControl(
  method = "cv",
  number = 10, 
  classProbs = TRUE, 
  summaryFunction = twoClassSummary
)

# Train Logistic Regression Model with Cross-Validation

cv_logistic_model <- train(
  long_path_binary ~ slat + slon + mag * wid,
  data = cleaned,
  method = "glm",
  family = binomial(),
  metric = "ROC", 
  trControl = cv_control
)

cv_logistic_model
## Generalized Linear Model 
## 
## 29507 samples
##     4 predictor
##     2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 26556, 26557, 26556, 26557, 26556, 26556, ... 
## Resampling results:
## 
##   ROC       Sens       Spec     
##   0.835001  0.9543641  0.3930859

The cross-validation step evaluates the model’s ability to predict long-path tornadoes. The 10-fold cross-validation process splits the data into training and testing subsets to ensure that the model generalizes well to unseen data. The ROC value of 0.835 indicates that the model has a good discriminative ability to differentiate between long-path and short-path tornadoes. This is complemented by high sensitivity (0.954) showing that most long-path tornadoes are correctly identified, while specificity (0.393) highlights room for improvement in minimizing false positives.

cv_results <- cv_logistic_model$results
cv_results
##   parameter      ROC      Sens      Spec       ROCSD      SensSD     SpecSD
## 1      none 0.835001 0.9543641 0.3930859 0.006624361 0.004754727 0.01020199
roc_curve <- roc(
  cleaned$long_path_binary,
  predict(cv_logistic_model, newdata = cleaned, type = "prob")[, "Yes"]
)
## Setting levels: control = No, case = Yes
## Setting direction: controls < cases
plot(roc_curve, main = "ROC Curve with Cross-Validation", col = "blue", lwd = 2)
abline(a = 0, b = 1, lty = 2, col = "gray")

The ROC curve visually confirms the model’s performance, with the curve substantially above the diagonal line (representing random chance). The area under the curve (AUC = 0.835) further reinforces the logistic regression’s capability to classify tornado paths effectively. While sensitivity is strong, specificity should be revisited, perhaps by balancing the dataset or exploring more feature interactions.

library(car)

vif(cv_logistic_model$finalModel)
##      slat      slon       mag       wid `mag:wid` 
##  1.045057  1.055685  2.108353  2.683890  4.213116
summary(cv_logistic_model$finalModel)
## 
## Call:
## NULL
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.9785554  0.2067003  -9.572  < 2e-16 ***
## slat         0.0120309  0.0033344   3.608 0.000308 ***
## slon         0.0085452  0.0020045   4.263 2.02e-05 ***
## mag          0.7974274  0.0306795  25.992  < 2e-16 ***
## wid          0.0054422  0.0001725  31.556  < 2e-16 ***
## `mag:wid`   -0.0006512  0.0001119  -5.820 5.87e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 33182  on 29506  degrees of freedom
## Residual deviance: 25463  on 29501  degrees of freedom
## AIC: 25475
## 
## Number of Fisher Scoring iterations: 6

The variable importance analysis highlights significant predictors:

  • Magnitude (mag): Tornado magnitude is the strongest predictor, with higher magnitudes leading to an increased likelihood of long paths.

  • Width (wid): Wider tornadoes significantly contribute to predicting long-path events, consistent with their destructive nature.

  • Latitude (slat): Tornadoes at higher latitudes tend to have longer paths.

  • Longitude (slon): Tornadoes further east also predict longer paths.

  • Interaction term (mag:wid): Indicates that the combination of magnitude and width is vital, possibly reflecting compound effects on destruction.

The variance inflation factor (VIF) values are all below the commonly accepted threshold of 5, suggesting no severe multicollinearity among predictors. This ensures model stability and interpretability.

3. Does the distribution of tornado widths fit an exponential distribution?

library(MASS)
library(ggplot2)

tornado_widths <- cleaned$wid[!is.na(cleaned$wid)]

exp_fit <- fitdistr(tornado_widths, "exponential")

cat("MLE Estimated Rate (Lambda):", exp_fit$estimate, "\n")
## MLE Estimated Rate (Lambda): 0.007025694

This analysis examines whether tornado widths follow an exponential distribution by using Maximum Likelihood Estimation (MLE) to estimate the rate parameter (λ). The estimated rate parameter (λ) is 0.0070, indicating the rapid decline in the density of larger tornado widths. The exponential distribution assumes that events (in this case, tornado widths) decrease in frequency exponentially as their magnitude increases. This aligns with the expectation that narrower tornadoes are much more frequent than wider ones.

4. How do fatalities and loss differ across seasons

To investigate whether there are significant differences in fatalities and injuries across seasons (Winter, Spring, Summer, Fall), a non-parametric method was used here to compare the medians of loss and fatalities across different seasons

Hypotheses Kruskal-Wallis Test

To evaluate the seasonal differences in tornado impacts, Kruskal-Wallis tests were conducted for both loss and fatalities across seasons. The null hypothesis for each test is that the median values of loss and fatalities are the same across all seasons, while the alternative hypothesis states that at least one season has a significantly different median. Using the cleaned dataset, which includes the categorical variable season and the continuous variables loss and fat (fatalities), the test statistics and p-values were calculated to assess the significance of these differences at a 0.05 significance level. This approach allows for a robust comparison of median values across seasons, even with non-normally distributed data.

cleaned=cleaned |> 
  mutate(
    # Create season as a factor based on the month
    season = factor(case_when(
      mo %in% c(12, 1, 2) ~ "Winter",
      mo %in% c(3, 4, 5) ~ "Spring",
      mo %in% c(6, 7, 8) ~ "Summer",
      mo %in% c(9, 10, 11) ~ "Fall"
    ), levels = c("Winter", "Spring", "Summer", "Fall")),
    # Convert mag to a factor
    mag = as.factor(mag)
  )

# Kruskal-Wallis test for loss across seasons
kruskal_loss=kruskal.test(loss ~ season, data = cleaned)
print(kruskal_loss)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  loss by season
## Kruskal-Wallis chi-squared = 1162.5, df = 3, p-value < 2.2e-16
# test for fatalities across seasons
kruskal_fatalities=kruskal.test(fat~ season, data = cleaned)
print(kruskal_fatalities)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  fat by season
## Kruskal-Wallis chi-squared = 146.92, df = 3, p-value < 2.2e-16

The test result shows significant seasonal variations in both economic losses and fatalities. For economic losses, the p-value was less than 0.05, indicating strong evidence to reject the null hypothesis that losses are evenly distributed across all seasons. Similarly, the p-value for fatalities was less than 0.05, also providing strong evidence to reject the null hypothesis and conclude that fatalities vary significantly across seasons.

To further identify specific seasonal differences, post-hoc test were then utilized. The Dunn’s test with Bonferroni adjustment was then conducted

Post-hoc Test Results for Loss Across Seasons

# Post-hoc test for loss across seasons
dunn_loss=dunnTest(loss ~ season, data = cleaned, method = "bonferroni")
print(dunn_loss)
## Dunn (1964) Kruskal-Wallis multiple comparison
##   p-values adjusted with the Bonferroni method.
##        Comparison          Z       P.unadj         P.adj
## 1   Fall - Spring   6.664038  2.664053e-11  1.598432e-10
## 2   Fall - Summer  20.528670  1.193991e-93  7.163947e-93
## 3 Spring - Summer  19.318850  3.728658e-83  2.237195e-82
## 4   Fall - Winter -12.277059  1.202927e-34  7.217560e-34
## 5 Spring - Winter -20.128608  4.144714e-90  2.486828e-89
## 6 Summer - Winter -31.691891 2.009520e-220 1.205712e-219

In the post-hoc Dunn’s test for median based losses, all pairwise comparisons between seasons show significant differences as the p values for them are much smaller than 0.05. And it is exhibited that winter shows significantly higher. Fall has higher losses compared to spring and summer, and winter generally experiences the highest median of losses compared to all other seasons. Given the significant seasonal trends in losses, it is crucial to focus more on Fall and Winter tornadoes when addressing economic impacts even if the tornadoes are not most frequently occurring in these seasons

Post-hoc Test Results for Fatalities Across Seasons

# Post-hoc test for injuries across seasons
dunn_fatalities=dunnTest(fat~ season, data = cleaned, method = "bonferroni")
print(dunn_fatalities)
## Dunn (1964) Kruskal-Wallis multiple comparison
##   p-values adjusted with the Bonferroni method.
##        Comparison           Z      P.unadj        P.adj
## 1   Fall - Spring  -0.5929891 5.531884e-01 1.000000e+00
## 2   Fall - Summer   5.3279154 9.934636e-08 5.960781e-07
## 3 Spring - Summer   7.9963062 1.282073e-15 7.692437e-15
## 4   Fall - Winter  -6.0488587 1.458755e-09 8.752529e-09
## 5 Spring - Winter  -6.5895148 4.412662e-11 2.647597e-10
## 6 Summer - Winter -11.4469857 2.434651e-30 1.460791e-29

Regarding the comparisons of median based fatalities between different seasons, the adjusted p value shows the statistically significant differences for all except the fall and spring, indicating the approximately similar median fatalities between these two seasons. Based on the observations of other comparisons,winter season shows relatively higher fatalities compare to other seasons, followed by summer. Therefore, more preventive measures should be taken during Summer and Winter to address the heightened risks of fatalities.

5. How does the magnitude of events predict the distribution of fatalities?

To examine whether there is an association between the variables mag (magnitude) and fat (fatalities) in the dataset, a Chi-Square test of independence was performed. The Null Hypothesis (H0) states that the magnitude of the tornadoes and fatalities are independent. And the Alternative Hypothesis (H1) states that there is an association between the magnitude of tornado events and numbers of fatalities.

table_data=table(cleaned$mag, cleaned$fat)

# Perform Chi-Square Test
chisq.test(table_data)
## 
##  Pearson's Chi-squared test
## 
## data:  table_data
## X-squared = 25983, df = 162, p-value < 2.2e-16

The result shows that the p-value is far below the standard significance level (e.g., 0.05), so we reject the null hypothesis, indicating that there is a statistically significant association between the magnitude of the event and the number of fatalities. This result statistically confirms the observations initially identified during the exploratory data analysis (EDA), where trends suggested that higher magnitudes are likely to result in more fatalities.

6. Model comparison on predicting fatalities based on event characteristics.

The linear regression model was then developed to analyze the relationship between fatalities (fat) as the dependent variable and multiple predictors: magnitude (mag), injuries (inj), loss (loss), length (len), and width (wid). And diagnostics were conducted to evaluate the assumptions of linear regression, including linearity, normality, homogeneity of variance, and outliers. The check_model() function was used to evaluate four assumptions of linear regression model: Linearity, Normality (QQ Plot), Homogeneity of Variance (Homoscedasticity) and Influential Observations

# Prepare the model specification
lm_spec =linear_reg()|>
  set_mode("regression")|>
  set_engine("lm")

final_model=lm(fat ~ mag + inj + loss + len + wid, data = cleaned)

check_model(final_model, check = c("linearity", "qq", "homogeneity", "outliers"))

Diagnostic Report for Linear Regression Model

  1. Linearity: The residuals display a curved pattern, and the residuals do not scatter randomly around the horizontal line at zero, indicating that the assumption of linearity is violated.
  2. Homoscedasticity: the variance of errors is not constant across all levels of fitted values, so it is also violated
  3. Influential Observations: Several points, particularly those with leverage values exceeding 0.2, are flagged as potentially influential, which requires the further investigation.
  4. Normality check: The QQ plot demonstrates significant deviations from normality at both ends.

Polynomial Regression

With the diagnostic observations from the linear regression model, a more flexible modeling approach was implemented using polynomial terms and interaction effects to better capture the nonlinear relationships and reduce the influence of heteroscedasticity and non linearity.

Model 1 was selected with the following predictors as some of them were tested and observed in the previous analysis:

  • Formula: fat ~ poly(mag, 2) + poly(inj, 2) + poly(loss, 2) + poly(len, 2) + poly(wid, 2), second-order polynomial terms for all predictors (mag, inj, loss, len, and wid) without interaction terms.
# Fit the regression model
model=lm(fat ~ poly(mag, 2) + poly(inj, 2) + poly(loss, 2) + poly(len, 2) + poly(wid, 2), data = cleaned)

# Tidy the model results and print in a neat table
broom::tidy(model) |> 
  knitr::kable(digits = 3, caption = "Regression Model Results with Polynomial Terms")
Regression Model Results with Polynomial Terms
term estimate std.error statistic p.value
(Intercept) 0.060 0.004 13.334 0.000
poly(mag, 2)1 8.860 1.052 8.422 0.000
poly(mag, 2)2 15.400 0.855 18.015 0.000
poly(inj, 2)1 161.696 0.858 188.433 0.000
poly(inj, 2)2 -2.946 0.858 -3.435 0.001
poly(loss, 2)1 -11.555 0.780 -14.815 0.000
poly(loss, 2)2 3.483 0.775 4.495 0.000
poly(len, 2)1 8.696 1.028 8.462 0.000
poly(len, 2)2 27.108 0.857 31.628 0.000
poly(wid, 2)1 1.812 1.029 1.760 0.078
poly(wid, 2)2 -1.252 0.827 -1.515 0.130

The comparison model was used to test if simplifying the model by using fewer polynomial terms and focusing on interactions can improve generalization and further improve the performance of the prediction. So Model 2:

  • Formula: fat ~ poly(mag, 2) + inj:loss + poly(len, 2) + poly(wid, 2), second-order polynomial terms for mag, len, and wid, as well as an interaction term between inj and loss.

Cross Validation

The dataset was split into training and testing sets for cross-validation, with each model trained on the training set and evaluated on the corresponding test set. Prediction accuracy was assessed using the Root Mean Squared Error (RMSE). And the violin plot is used to visualize the rmse of using cross-validation

cv_df <- crossv_mc(cleaned, 100) |> 
  mutate(
    # Convert train and test sets to tibbles
    train = map(train, as_tibble),
    test = map(test, as_tibble)
  )
# Fit models and calculate metrics
cv_results=cv_df |> 
  mutate(
    # Fit models with polynomial terms
     model1_mod = map(train, \(df) lm(fat ~ poly(mag, 2) + poly(inj, 2) + poly(loss, 2) + poly(len, 2) + poly(wid, 2), data = df)),
    model2_mod = map(train, \(df) lm(fat ~ poly(mag, 2) + inj:loss + poly(len, 2) + poly(wid, 2), data = df)),
   
  ) |>
  mutate(
    # Calculate RMSE for each model on the test sets
    rmse_model1 = map2_dbl(model1_mod, test, \(mod, df) rmse(model = mod, data = df)),
    rmse_model2 = map2_dbl(model2_mod, test, \(mod, df) rmse(model = mod, data = df))
  )

# Combine metrics for all models
summary_results=cv_results |> 
  summarise(

    mean_rmse_model1 = mean(rmse_model1),
    mean_rmse_model2 = mean(rmse_model2)
  )

print(summary_results)
## # A tibble: 1 × 2
##   mean_rmse_model1 mean_rmse_model2
##              <dbl>            <dbl>
## 1            0.718             1.14
cv_long=cv_results |> select(starts_with("rmse"))|>
  pivot_longer(
    everything(),
    names_to = "model", 
    values_to = "rmse",
    names_prefix = "rmse_") |> 
  mutate(model = fct_inorder(model))


cv_long|>ggplot(aes(x = model, y = rmse)) + 
   geom_violin()+
  labs(
    title = "Cross-Validated RMSE for fatality",
    x = "Model",
    y = "RMSE"
  ) +
  theme_minimal()

It is observed that the better performance of the first model model1_mod using cross-validation is slightly better compared to the second model model2_mod, with the RMSE increase from about 0.71 to 1.12.